bash command
Cyber-Zero: Training Cybersecurity Agents without Runtime
Zhuo, Terry Yue, Wang, Dingmin, Ding, Hantian, Kumar, Varun, Wang, Zijian
Large Language Models (LLMs) have achieved remarkable success in software engineering tasks when trained with executable runtime environments, particularly in resolving GitHub issues. However, such runtime environments are often unavailable in other domains, especially cybersecurity, where challenge configurations and execution contexts are ephemeral or restricted. We present Cyber-Zero, the first runtime-free framework for synthesizing high-quality agent trajectories to train cybersecurity LLMs. Cyber-Zero leverages publicly available CTF writeups and employs persona-driven LLM simulation to reverse-engineer runtime behaviors and generate realistic, long-horizon interaction sequences without actual environments. Using trajectories synthesized by Cyber-Zero, we train LLM-based agents that achieve up to 13.1% absolute performance gains over baseline models on three prominent CTF benchmarks: InterCode-CTF, NYU CTF Bench, and Cybench. Our best model, Cyber-Zero-32B, establishes new state-of-the-art performance among open-weight models, matching the capabilities of proprietary systems like DeepSeek-V3-0324 and Claude-3.5-Sonnet while offering superior cost-effectiveness, and demonstrating that runtime-free trajectory synthesis can effectively democratize the development of state-of-the-art cybersecurity agents.
LLM-Supported Natural Language to Bash Translation
Westenfelder, Finnian, Hemberg, Erik, Tulla, Miguel, Moskal, Stephen, O'Reilly, Una-May, Chiricescu, Silviu
The Bourne-Again Shell (Bash) command-line interface for Linux systems has complex syntax and requires extensive specialized knowledge. Using the natural language to Bash command (NL2SH) translation capabilities of large language models (LLMs) for command composition circumvents these issues. However, the NL2SH performance of LLMs is difficult to assess due to inaccurate test data and unreliable heuristics for determining the functional equivalence of Bash commands. We present a manually verified test dataset of 600 instruction-command pairs and a training dataset of 40,939 pairs, increasing the size of previous datasets by 441% and 135%, respectively. Further, we present a novel functional equivalence heuristic that combines command execution with LLM evaluation of command outputs. Our heuristic can determine the functional equivalence of two Bash commands with 95% confidence, a 16% increase over previous heuristics. Evaluation of popular LLMs using our test dataset and heuristic demonstrates that parsing, in-context learning, in-weight learning, and constrained decoding can improve NL2SH accuracy by up to 32%. Our findings emphasize the importance of dataset quality, execution-based evaluation and translation method for advancing NL2SH translation. Our code is available at https://github.com/westenfelder/NL2SH
CodeR: Issue Resolving with Multi-Agent and Task Graphs
Chen, Dong, Lin, Shaoxin, Zeng, Muhan, Zan, Daoguang, Wang, Jian-Gang, Cheshkov, Anton, Sun, Jun, Yu, Hao, Dong, Guoliang, Aliev, Artem, Wang, Jie, Cheng, Xiao, Liang, Guangtai, Ma, Yuchi, Bian, Pan, Xie, Tao, Wang, Qianxiang
The rapidly growing capability of Large Language Models (LLMs) is dramatically reshaping many industries [2, 3, 4]. The most recent release of GPT-4o [5] demonstrates a significant leap in multi-modal capabilities and artificial intelligence (AI)-human interaction, whilst maintaining the same level of text generation, reasoning, and code intelligence as GPT-4-Turbo [6]. LLMs can interact with humans and the world as humans do, it is considered a starting point for LLMs to take over tasks from humans or collaborate naturally with humans. Issue resolving is one of the software engineering tasks experimented with LLMs that is particularly relevant in practice. SWE-bench [1] collects 2,294 real-world issues from 12 popular Python libraries.
Tackling Execution-Based Evaluation for NL2Bash
Vo, Ngoc Phuoc An, Paulovicks, Brent, Sheinin, Vadim
Given recent advancement of Large Language Models (LLMs), the task of translating from natural language prompts to different programming languages (code generation) attracts immense attention for wide application in different domains. Specially code generation for Bash (NL2Bash) is widely used to generate Bash scripts for automating different tasks, such as performance monitoring, compilation, system administration, system diagnostics, etc. Besides code generation, validating synthetic code is critical before using them for any application. Different methods for code validation are proposed, both direct (execution evaluation) and indirect validations (i.e. exact/partial match, BLEU score). Among these, Execution-based Evaluation (EE) can validate the predicted code by comparing the execution output of model prediction and expected output in system. However, designing and implementing such an execution-based evaluation system for NL2Bash is not a trivial task. In this paper, we present a machinery for execution-based evaluation for NL2Bash. We create a set of 50 prompts to evaluate some popular LLMs for NL2Bash. We also analyze several advantages and challenges of EE such as syntactically different yet semantically equivalent Bash scripts generated by different LLMs, or syntactically correct but semantically incorrect Bash scripts, and how we capture and process them correctly.
Evaluating Language-Model Agents on Realistic Autonomous Tasks
Kinniment, Megan, Sato, Lucas Jun Koba, Du, Haoxing, Goodrich, Brian, Hasin, Max, Chan, Lawrence, Miles, Luke Harold, Lin, Tao R., Wijk, Hjalmar, Burget, Joel, Ho, Aaron, Barnes, Elizabeth, Christiano, Paul
In this report, we explore the ability of language model agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild. We refer to this cluster of capabilities as "autonomous replication and adaptation" or ARA. We believe that systems capable of ARA could have wide-reaching and hard-to-anticipate consequences, and that measuring and forecasting ARA may be useful for informing measures around security, monitoring, and alignment. Additionally, once a system is capable of ARA, placing bounds on a system's capabilities may become significantly more difficult. We construct four simple example agents that combine language models with tools that allow them to take actions in the world. We then evaluate these agents on 12 tasks relevant to ARA. We find that these language model agents can only complete the easiest tasks from this list, although they make some progress on the more challenging tasks. Unfortunately, these evaluations are not adequate to rule out the possibility that near-future agents will be capable of ARA. In particular, we do not think that these evaluations provide good assurance that the ``next generation'' of language models (e.g. 100x effective compute scaleup on existing models) will not yield agents capable of ARA, unless intermediate evaluations are performed during pretraining. Relatedly, we expect that fine-tuning of the existing models could produce substantially more competent agents, even if the fine-tuning is not directly targeted at ARA.
NL2CMD: An Updated Workflow for Natural Language to Bash Commands Translation
Fu, Quchen, Teng, Zhongwei, Georgaklis, Marco, White, Jules, Schmidt, Douglas C.
Translating natural language into Bash Commands is an emerging research field that has gained attention in recent years. Most efforts have focused on producing more accurate translation models. To the best of our knowledge, only two datasets are available, with one based on the other. Both datasets involve scraping through known data sources (through platforms like stack overflow, crowdsourcing, etc.) and hiring experts to validate and correct either the English text or Bash Commands. This paper provides two contributions to research on synthesizing Bash Commands from scratch. First, we describe a state-of-the-art translation model used to generate Bash Commands from the corresponding English text. Second, we introduce a new NL2CMD dataset that is automatically generated, involves minimal human intervention, and is over six times larger than prior datasets. Since the generation pipeline does not rely on existing Bash Commands, the distribution and types of commands can be custom adjusted. We evaluate the performance of ChatGPT on this task and discuss the potential of using it as a data generator. Our empirical results show how the scale and diversity of our dataset can offer unique opportunities for semantic parsing researchers.
Train YOLO for Object Detection on a Custom Dataset using Python
I recently started working in the field of computer vision. And in these early days, I'm studying how the various algorithms of object detection work. Among the most well-known ones are R-CNN, Fast R-CNN, Faster R-CNN and of course YOLO. In this article, I want to focus on the last mentioned algorithm. YOLO is the state of the art in object detection and there are endless use cases where YOLO can be used.
11 Reasons To Learn Bash (A.K.A. Command Line)
But it's not just a skill for software devs -- learning bash can be valuable for anyone who works with data. In short, Bash is the Unix command-line interface (CLI). You'll also see it called the terminal, the command line, or the shell. It's a command language that allows us to work with files on our computers in a way that's far more efficient and powerful than using a GUI (graphical user interface). Making the switch from graphical user interfaces (GUIs) to a command-line interface can feel overwhelming.
OpenAI-Powered Linux Shell
This is a basic Python shell (really, it's a fancy wrapper over the system shell) that takes a task and asks OpenAI for what Linux bash command to run based on your description. For safety reasons, you can look at the command and cancel before actually running it. To be clear, I'm not trying to convince you that having an AI model figure out what Linux command to run based on your written description is a good idea, but the commands that it generates are, well - watch the video if you want to see. There are several pre-canned ways of interacting with the models that OpenAI provides (the "GPT" models): completing a provided fragment, answering a question, generating "ideas" from a topic, summarizing a passage, etc. This shell uses the question-and-answer format and provides the model with an "example context" and examples of input and output.
UMAP clustering in Python
The aim of this short Python tutorial is to introduce the uniform manifold approximation and projection (UMAP) algorithm, using 76,533 single-cell expression profiles from the human primary motor cortex. The data are available from the Cell Types database, which is part of the Allen Brain Map platform. The UMAP has quickly established itself as a go-to clustering tool well poised to expand our knowledge of various many things, including the human brain. I hope by the end of this tutorial you will have a broad understanding of the UMAP algorithm and how to implement it. Uniform manifold approximation and projection (UMAP)1 is a scalable and efficient dimension reduction algorithm that performs competitively among state-of-the-art methods such as t-SNE2, and widely applied for unsupervised clustering.